[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259

srowen · 2018-08-28T21:50:21Z

What changes were proposed in this pull request?

Alternative take on #22063 that does not introduce udfInternal.
Resolve issue with inferring func types in 2.12 by instead using info captured when UDF is registered -- capturing which types are nullable (i.e. not primitive)

How was this patch tested?

Existing tests.

srowen · 2018-08-28T21:50:30Z

@cloud-fan you might be interested in this. This is a simpler version that doesn't use udfInternal. This is just about the nullability issue, really. It leaves the current behavior, that schema inference failure is quietly ignored.

SparkQA · 2018-08-29T01:36:36Z

Test build #95387 has finished for PR 22259 at commit d99cfa6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-29T03:58:23Z

LGTM. ~~How did you work around the type tag not found issue?~~ nvm, I saw your comments in the previous PR.

SparkQA · 2018-08-29T06:45:36Z

Test build #4297 has finished for PR 22259 at commit d99cfa6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-29T07:19:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-          inputsNullCheck
-            .map(If(_, Literal.create(null, udf.dataType), udf.copy(children = newInputs)))
-            .getOrElse(udf)
+        case udf@ScalaUDF(func, _, inputs, _, _, _, _, nullableTypes) =>


nit: can we restore the spaces as in the original?

ah missed it. We can clean it up later.

I might be missing it - what is the space issue? There's an additional level of indent because of the if statement

it is in udf@ScalaUDF which should have been udf @ ScalaUDF

Oh right. Yeah I didn't mean to change that. It's minor enough to leave I think. (or else standardize across the code)

cloud-fan · 2018-08-29T07:24:10Z

thanks, merging to master!

gatorsmile · 2018-09-03T06:20:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

@@ -47,7 +48,8 @@ case class ScalaUDF(
    inputTypes: Seq[DataType] = Nil,
    udfName: Option[String] = None,
    nullable: Boolean = true,
-    udfDeterministic: Boolean = true)
+    udfDeterministic: Boolean = true,
+    nullableTypes: Seq[Boolean] = Nil)


Using Nil as the default value is dangerous. We even do not have an assert to ensure it is set. We could easily miss the setting without the right values.

We put the assert in the rule HandleNullInputsForUDF.

can we merge inputTypes and nullableTypes here so that we don't need to worry about it any more? cc @srowen

The logic here was again that we wanted to avoid changing the binary signature. I know catalyst is effectively private to Spark, but this wasn't marked specifically private; I wondered if it would actually affect callers? If not we can go back and merge it.

Nil is just an empty list; I don't think it's dangerous and it is used above in inputTypes. It is not always set, because it's not always possible to infer the schema, let alone nullability.

I think the problem is more about the way we handle nullableTypes if not specified as in https://github.com/apache/spark/pull/22259/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2157. The test failure of https://github.com/apache/spark/pull/21851/files#diff-e8dddba2915a147970654aa93bee30a7R344 would have been exposed if the nullableTypes had been updated in this PR. So I would say logically this parameter is required, but right now it is declared optional. In this case, things went wrong when nullableTypes was left unspecified, and this could happen not only with tests but in "source" too. I suggest we move this parameter up right after inputTypes so it can get the attention it needs.

I see, you are saying that some UDF needed to declare nullable types but didn't? I made the param optional to try to make 'migration' easier and avoid changing the signature much. But, the test you point to, doesn't it pass? are you saying it should not pass?

Yeah I could see an argument that it need not block release. The functionality works as intended, at least.

Would you change it again in 2.4.1? If not then we decide to just keep this behavior. Let's say at least get this in if there is a new RC.

I think it's mostly about maintainability. We should definitely update the code as @maryannxue said, so that ScalaUDF is easier to use and not that error-prone. I feel we don't need to backport it, as it's basically code refactor.

@maryannxue Please submit a PR to make the parameter nullableTypes required.

Yeah looks like we should just make these changes after all, and for 2.4, as we need another RC.

## What changes were proposed in this pull request? This is a followup of #22259 . Scala case class has a wide surface: apply, unapply, accessors, copy, etc. In #22259 , we change the type of `UserDefinedFunction.inputTypes` from `Option[Seq[DataType]]` to `Option[Seq[Schema]]`. This breaks backward compatibility. This PR changes the type back, and use a `var` to keep the new nullable info. ## How was this patch tested? N/A Closes #22319 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e816776) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This is a follow-up PR for apache#22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. apache#22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after apache#22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes apache#22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Update UDF support to handle nullability correctly in Scala 2.12

d99cfa6

srowen mentioned this pull request Aug 28, 2018

[WIP][SPARK-25044][SQL] Address translation of LMF closure primitive args to Object in Scala 2.12 #22063

Closed

mgaido91 reviewed Aug 29, 2018

View reviewed changes

asfgit closed this in 1fd59c1 Aug 29, 2018

srowen changed the title ~~[WIP][SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12~~ [SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 Sep 3, 2018

cloud-fan mentioned this pull request Sep 3, 2018

[SPARK-25044][SQL][followup] add back UserDefinedFunction.inputTypes #22319

Closed

gatorsmile reviewed Sep 3, 2018

View reviewed changes

srowen deleted the SPARK-25044.2 branch September 6, 2018 17:50

maryannxue mentioned this pull request Oct 15, 2018

[SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor signature #22732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259

[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259

srowen commented Aug 28, 2018

srowen commented Aug 28, 2018 •

edited

Loading

SparkQA commented Aug 29, 2018

cloud-fan commented Aug 29, 2018 •

edited

Loading

SparkQA commented Aug 29, 2018

mgaido91 Aug 29, 2018

cloud-fan Aug 29, 2018

srowen Aug 29, 2018

mgaido91 Aug 29, 2018

srowen Aug 29, 2018

cloud-fan commented Aug 29, 2018

gatorsmile Sep 3, 2018

cloud-fan Sep 3, 2018

srowen Sep 3, 2018

maryannxue Oct 10, 2018

srowen Oct 10, 2018

srowen Oct 11, 2018

srowen Oct 11, 2018

cloud-fan Oct 11, 2018

gatorsmile Oct 11, 2018

srowen Oct 11, 2018

[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259

[SPARK-25044][SQL] (take 2) Address translation of LMF closure primitive args to Object in Scala 2.12 #22259

Conversation

srowen commented Aug 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Aug 28, 2018 • edited Loading

SparkQA commented Aug 29, 2018

cloud-fan commented Aug 29, 2018 • edited Loading

SparkQA commented Aug 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Aug 28, 2018 •

edited

Loading

cloud-fan commented Aug 29, 2018 •

edited

Loading